Fast Failure Recovery in Distributed Graph Processing Systems

نویسندگان

Yanyan Shen

Gang Chen

H. V. Jagadish

Wei Lu

Beng Chin Ooi

Bogdan Marius Tudor

چکیده

Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing

The success of Google’s Pregel framework in distributed graph processing has inspired a surging interest in developing Pregel-like platforms featuring a user-friendly “think like a vertex” programming model. Existing Pregel-like systems support a fault tolerance mechanism called checkpointing, which periodically saves computation states as checkpoints to HDFS, so that when a failure happens, co...

متن کامل

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Large-scale interactive applications and online graph analytic processing require very fast data access to many small data objects. DXRAM addresses these challenges by keeping all data always in memory of potentially many nodes aggregated in a data center. Data loss in case of node failures is prevented by an asynchronous logging on flash disks. In this paper we present the architecture of a no...

متن کامل

A new Shuffled Genetic-based Task Scheduling Algorithm in Heterogeneous Distributed Systems

Distributed systems such as Grid- and Cloud Computing provision web services to their users in all of the world. One of the most important concerns which service providers encounter is to handle total cost of ownership (TCO). The large part of TCO is related to power consumption due to inefficient resource management. Task scheduling module as a key component can has drastic impact on both user...

متن کامل

A Protocol for Consistent Checkpointing Recovery for Time-Critical Distributed Database Systems

This paper presents a checkpointing scheme which effectively copes with media failures for a distributed database system (DDBS), which employs the timestamp ordering scheme for concurrency control. In our scheme, normal transactions are executed during the checkpointing process without any interruption. The state of the database taken as a checkpoint by all sites in the system is consistent, so...

متن کامل

Implementation and Performance of Transparent Rollback-recovery in Manetho

We describe the implementation and performance of rollback-recovery in Manetho. During failure-free operation, Manetho maintains an antecedence graph which records the \happened before" relation between certain events in the distributed computation. The antecedence graph is used in combination with checkpointing and volatile sender-based message logging to simultaneously achieve low failure-fre...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 8 شماره

صفحات -

تاریخ انتشار 2014

Fast Failure Recovery in Distributed Graph Processing Systems

نویسندگان

چکیده

منابع مشابه

Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

A new Shuffled Genetic-based Task Scheduling Algorithm in Heterogeneous Distributed Systems

A Protocol for Consistent Checkpointing Recovery for Time-Critical Distributed Database Systems

Implementation and Performance of Transparent Rollback-recovery in Manetho

عنوان ژورنال:

اشتراک گذاری